NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Image and video tokenization with binary spherical quantization

Zhao, Yue; Xiong, Yuanjun; Krähenbühl, Philipp (April 2025, ICLR)

Free, publicly-accessible full text available April 24, 2026
QLIP: Text-Aligned Visual Tokenization Unifies Auto-Regressive Multimodal Understanding and Generation

Zhao, Yue; Xue, Fuzhao; Reed, Scott; Fan, Linxi; Zhu, Yuke; Kautz, Jan; Yu, Zhiding; Krähenbühl, Philipp; Huang, De-An (February 2025, cs.CV)

We introduce Quantized Language-Image Pretraining (QLIP), a visual tokenization method that combines state-of-the-art reconstruction quality with state-of-the-art zero-shot image understanding. QLIP trains a binary-spherical-quantization-based autoencoder with reconstruction and language-image alignment objectives. We are the first to show that the two objectives do not need to be at odds. We balance the two loss terms dynamically during training and show that a two-stage training pipeline effectively mixes the large-batch requirements of image-language pre-training with the memory bottleneck imposed by the reconstruction objective. We validate the effectiveness of QLIP for multimodal understanding and text-conditioned image generation with a single model. Specifically, QLIP serves as a drop-in replacement for the visual encoder for LLaVA and the image tokenizer for LlamaGen with comparable or even better performance. Finally, we demonstrate that QLIP enables a unified mixed-modality auto-regressive model for understanding and generation.
more » « less
Free, publicly-accessible full text available February 7, 2026
Distilling Structural Representations into Protein Sequence Models

https://doi.org/10.1101/2024.11.08.622579

Ouyang-Zhang, Jeffrey; Gong, Chengyue; Zhao, Yue; Krähenbühl, Philipp; Klivans, Adam R; Diaz, Daniel J (November 2024, bioRxiv)

Abstract Protein language models, like the popular ESM2, are widely used tools for extracting evolution-based protein representations and have achieved significant success on downstream biological tasks. Representations based on sequence and structure models, however, show significant performance differences depending on the downstream task. A major open problem is to obtain representations that best capture both the evolutionary and structural properties of proteins in general. Here we introduceImplicitStructureModel(ISM), a sequence-only input model with structurally-enriched representations that outperforms state-of-the-art sequence models on several well-studied benchmarks including mutation stability assessment and structure prediction. Our key innovations are a microenvironment-based autoencoder for generating structure tokens and a self-supervised training objective that distills these tokens into ESM2’s pre-trained model. We have madeISM’s structure-enriched weights easily available: integrating ISM into any application using ESM2 requires changing only a single line of code. Our code is available athttps://github.com/jozhang97/ISM.
more » « less
Full Text Available
Image and Video Tokenization with Binary Spherical Quantization

Zhao, Yue; Xiong, Yuanjun; Krähenbühl, Philipp (June 2024, https://doi.org/10.48550/arXiv.2406.07548)

This work introduces a transformer-based image and video tokenizer leveraging Binary Spherical Quantization (BSQ). The method projects high-dimensional visual embeddings onto a lower-dimensional hypersphere followed by binary quantization. BSQ offers three key benefits: (1) parameter efficiency without requiring an explicit codebook, (2) scalability to arbitrary token dimensions, and (3) high compression capability—up to 100× compression of visual data with minimal distortion. The tokenizer architecture includes a transformer encoder-decoder with block-wise causal masking to handle variable-length video inputs. The resulting model, BSQ-ViT, achieves state-of-the-art visual reconstruction performance on image and video benchmarks while delivering 2.4× higher throughput compared to previous best methods. Additionally, BSQ-ViT supports video compression via autoregressive priors for adaptive arithmetic coding, achieving results comparable to leading video compression standards. Furthermore, it enables masked language models to achieve competitive image synthesis quality relative to GAN- and diffusion-based approaches.
more » « less
Full Text Available
Distilling Vision-Language Models on Millions of Videos

https://doi.org/10.1109/CVPR52733.2024.01245

Zhao, Yue; Zhao, Long; Zhou, Xingyi; Wu, Jialin; Chu, Chun-Te; Miao, Hui; Schroff, Florian; Adam, Hartwig; Liu, Ting; Gong, Boqing; et al (June 2024, CVPR)

Full Text Available
PartDistillation: Learning Parts from Instance Segmentation

https://doi.org/10.1109/CVPR52729.2023.00691

Cho, Jang Hyun; Krähenbühl, Philipp; Ramanathan, Vignesh (June 2023, IEEE)

Full Text Available
Learning Video Representations from Large Language Models

https://doi.org/10.1109/CVPR52729.2023.00637

Zhao, Yue; Misra, Ishan; Krähenbühl, Philipp; Girdhar, Rohit (June 2023, IEEE CVPR)

Full Text Available
Long-tail Detection with Effective Class-Margins

Cho, Jang Hyun; Krähenbühl, Philipp (October 2022, European Conference on Computer Vision)

Large-scale object detection and instance segmentation face a severe data imbalance. The finer-grained object classes become, the less frequent they appear in our datasets. However, at test-time, we expect a detector that performs well for all classes and not just the most frequent ones. In this paper, we provide a theoretical understanding of the long-trail detection problem. We show how the commonly used mean average precision evaluation metric on an unknown test set is bound by a margin-based binary classification error on a long-tailed object detection training set. We optimize margin-based binary classification error with a novel surrogate objective called \textbf{Effective Class-Margin Loss} (ECM). The ECM loss is simple, theoretically well-motivated, and outperforms other heuristic counterparts on LVIS v1 benchmark over a wide range of architecture and detectors.
more » « less
Full Text Available
Detecting twenty-thousand classes using image-level supervision

Zhou, Xingyi; Girdhar, Rohit; Joulin, Armand; Krähenbühl, Philipp; Misra, Ishan (October 2022, European Conference on Computer Vision)

Current object detectors are limited in vocabulary size due to the small scale of detection datasets. Image classifiers, on the other hand, reason about much larger vocabularies, as their datasets are larger and easier to collect. We propose Detic, which simply trains the classifiers of a detector on image classification data and thus expands the vocabulary of detectors to tens of thousands of concepts. Unlike prior work, Detic does not need complex assignment schemes to assign image labels to boxes based on model predictions, making it much easier to implement and compatible with a range of detection architectures and backbones. Our results show that Detic yields excellent detectors even for classes without box annotations. It outperforms prior work on both open-vocabulary and long-tail detection benchmarks. Detic provides a gain of 2.4 mAP for all classes and 8.3 mAP for novel classes on the open-vocabulary LVIS benchmark. On the standard LVIS benchmark, Detic obtains 41.7 mAP when evaluated on all classes, or only rare classes, hence closing the gap in performance for object categories with few samples. For the first time, we train a detector with all the twenty-one-thousand classes of the ImageNet dataset and show that it generalizes to new datasets without finetuning.
more » « less
Full Text Available
Real-Time Online Video Detection with Temporal Smoothing Transformers

Zhao, Yue; Krähenbühl, Philipp (January 2022, ECCV 2022)

Full Text Available

« Prev Next »

Search for: All records